Mining Heterogeneous Transformations for Record Linkage
نویسندگان
چکیده
Heterogeneous transformations are translations between strings that are not characterized by a single function. E.g., nicknames, abbreviations and synonyms are heterogeneous transformations while edit distances are not. Such transformations are useful for information retrieval, information extraction and text understanding. They are especially useful in record linkage, where we determine whether two records refer to the same entity by examining the similarities between their fields. However, heterogeneous transformations are usually created manually and without assurance they will be useful. This paper presents a data mining approach to discover heterogeneous transformations between two data sets, without labeled training data. In addition to simple transformations, our algorithm finds combinatorial transformations, such as synonyms and abbreviations together. Our experiments demonstrate that we discover many types of specialized transformations, and we show that by exploiting these transformations we can improve record linkage. Our approach makes discovering and exploiting heterogeneous transformations more scalable and robust by lessening the domain and human dependencies.
منابع مشابه
Mining the Heterogeneous Transformations between Data Sources to Aid Record Linkage
Heterogeneous transformations are translations between strings that are not characterized by a single function. E.g., nicknames, abbreviations and synonyms are heterogeneous transformations while edit distances are not. Such transformations are useful for information retrieval, information extraction and text understanding. They are especially useful in record linkage, where the problem is to d...
متن کاملPrivacy Preserving Record Linkage via grams Projections
Record linkage has been extensively used in various data mining applications involving sharing data. While the amount of available data is growing, the concern of disclosing sensitive information poses the problem of utility vs privacy. In this paper, we study the problem of private record linkage via secure data transformations. In contrast to the existing techniques in this area, we propose a...
متن کاملMAL4:6 - Using Data Mining for Record Linkage
This paper presents a first attempt at using pedigree-based data to improve record linkage. It describes a composite metric for similarity and a mechanism to extract relevant generational features. Results on a large data set demonstrate promise.
متن کاملRecord Linkage: Current Practice and Future Directions
Record linkage is the task of quickly and accurately identifying records corresponding to the same entity from one or more data sources. Record linkage is also known as data cleaning, entity reconciliation or identification and the merge/purge problem. This paper presents the “standard” probabilistic record linkage model and the associated algorithm. Recent work in information retrieval, federa...
متن کاملHigh-Performance Computing Techniques for Record Linkage
The task of linking together information from one or more data sources representing the same entity (patient, customer, provider, business, etc.) If no unique identifier is available, probabilistic linkage techniques have to be applied Applications of record linkage Remove duplicates in a data set (internal linkage) Merge new records into a larger master data set Create patient oriented statist...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007